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High speed and efficient multipliers are essential components in today’s 
computational circuits like digital signal processing, algorithms for 
cryptography and high performance processors. Invariably, almost all 
processing units will contain hardware multipliers based on some algorithm 
that fits the application requirement. Tremendous advances in VLSI 


technology over the past several years resulted in an increased need for high 





speed multipliers and compelled the designers to go for trade-offs among 
Keywords: speed, power consumption and area. Amongst various methods of 
multiplication, Vedic multipliers are gaining ground due to their expected 
improvement in performance. A novel multiplier design for high speed VLSI 
applications using Urdhva-Tiryagbhyam sutra of Vedic Multiplication has 
been presented in this paper. The proposed architecture modeled using 
Verilog HDL, simulated using Cadence NCSIM and synthesized using 


Binary Multiplication 
Carry Pre Computation 
Multiplier Architecture 
Operand Decomposition 





Vedic Multiplier Cadence RTL Compiler with 65nm TSMC library.The proposed multiplier 
architecture is compared with the existing multipliers and the results show 
significant improvement in speed and power dissipation. 
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1. INTRODUCTION 

Processors are important part of integrated circuits (IC). Large numbers of functionalities are packed 
in an IC thanks to tremendous growth in density of integration in recent times. As the number of functions 
increases, the need for computation also grows. With the advent of new process technologies, shrinking of 
feature size and availability of modern CAD tools, a development of complex integrated circuits for various 
applications is possible. Examples of such applications include digital signal processing [1,2], mobile 
computations and communications, multimedia applications and processing required for scientific computing 
and applications etc. The speed and efficiency of processor in such IC is very crucial for meeting the 
requirements of the applications supported by the IC. The speed of processor and efficiency of processor in- 
turn depends upon an arithmetic logic unit [3] which is considered as the main computational unit of the 
processor. 

Moreover, the multiplier units [4] are the most important hardware structures in a complex 
arithmetic unit. The multiplier units are capable of performing operations on operands of various data 
types such as calculating running sum of products. As multiplication is a crucial arithmetic operation in 
processors [5] and digital computer systems, multipliers are the core building block for many algorithms in a 
wide variety of computing applications. Although multipliers are main arithmetic components used for 
processing scientific data, the excessive power consumption and delay attracts attention from the research 
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community. Usually, multiple arithmetic cores working in parallel are used so as to process large amounts of 
data with relatively low power and delay. 

Various algorithms have been proposed for the hardware implementation of multipliers in the past. 
Add and Shift is the common algorithm used in designing of multiplier [6]. In parallel multipliers, the 
important parameter which is used to determine performance is the number of partial products which are 
needed to be added. One such algorithm is Modified Booth algorithm [7] which reduces the number of partial 
products during the multiplication which in turn increases the performance of the multiplier. Another 
algorithm is Wallace tree based algorithm which reduces number of adding stages and is used to improve the 
speed of multiplication. In some implementations, efficient multiplier architecture is designed by combining 
both Modified Booth algorithm and Wallace Tree algorithm. However, an increasing parallelism increases 
the number of shifts between intermediate sum and partial products which results in reduced speed, 
increased power consumption and also increased area because of irregular structure. Thus, in some cases, low 
power and compact multiplier architectures is implemented using serial multiplication algorithm. Serial 
multipliers [8] have better performance for power consumption and area with the delay tradeoff. Depending 
upon the application, either parallel or serial multipliers are selected to perform the operation. 

However, in the high speed processors which are operating at higher clock frequencies, the existing 
multiplier takes more delay for execution of the instructions. The existing multiplier units that consume more 
power are not suitable to be incorporated in the processors which are used in wireless and portable devices. 
Thus, power savings is an important area for improvement. 

In order to address the low power computation along with high performance, a new approach to 
multiplier design based on ancient Vedic Mathematics has been explored. The mathematical operations using 
Vedic mathematics are very fast and require less hardware. This aspect of Vedic mathematics can be utilized 
to increase the computational speed of multipliers. This paper describes the design and implementation of a 
Vedic multiplier based on Urdhva-Tiryagbhyam Sutra [9]-[11]. The number of steps required to perform a 
multiplication operation by using UrdhvaTiryagbhyam Sutra are considerably less compared to the 
conventional multiplication techniques. In this paper, we have further explored a novel method to enhance 
the speed of a Vedic multiplier by pre-computing the carries which are used during summation of partial 
products. The implementation of pre-computation logic using multiplexer based carry-look ahead logic and 
XOR logic resulted in reduction of delay. The proposed multiplier along with operand decomposition 
technique resulted in reduction of power consumption which in turn reduced the power-delay product of the 
multiplier. 

The structure of the paper is divided as follows: The methodology and the architecture of the 
proposed multipliers are given in section 2. Results are presented in section 3. Finally, conclusion is given in 
section 4. 


2. RESEARCH METHOD 
2.1. Carry pre-computation based binary multiplier 

An 8 bit Binary Vedic Multiplier has been proposed with A and B as inputs and P as the final 16-bit 
product. The block diagram for 8 bit multiplication is shown in Figure 1. In the proposed multiplier the 
operands A and B are divided into Higher and Lower parts with 4-bits each. 
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Figure 1. Block Diagram of 8-bit Multiplication 


In this type of multiplier an 8 bit Binary multiplication is realized using 4-bit binary vedic 
multiplication using carry pre-computation logic shown in below Figure 2. where A3, A2, Al, AO & B3, B2, 
B1, BO are 4 bit binary inputs and P7, P6, P5, P4, P3, P2, P1, PO are the binary output bits. 
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Figure 2. Carry Pre-Computation Based Multiplier 


The architecture of the 4-bit multiplier can be understood from the block diagram shown in Figure 3. 
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Figure 3. Architecture of Carry Pre-Computation based Multiplier 


The partial product generator is the first block of the multiplier to which the 4 bit multiplicand and 
multiplier are given as inputs. At this juncture, the multiplication technique used is Urdhva-Tiryagbhyam. 
The 4 bit multiplication results in a total of 16 partial products (pp1-pp16). The result of multiplying any one 
binary bit with another is either a zero or a one which is simply the logic of ANDing of the two bits. 

The products of AL*BL, AH*BL, AL*BH, AH*BH are determined using above 4-bit carry pre- 
computation based multiplier and the results of all sub multipliers are added to determine the final product. 
The block of the 8-bit multiplier is shown in Figure 4. 
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Figure 4. Block Diagram of 8-bit Multiplier Using 4-bit Carry Pre-Computation Based Multiplier 


The second stage in the block diagram is the carry generation circuit. Here, we have integrated pre- 
computation logic along with the Urdhva-Tiryagbhyam multiplication technique. The carry equations are 
generated separately for each column of partial products and the inputs for these equations are taken from the 
previous column. The equations for pre-computed carries are given below. 


c2 = pp5 & pp2; (3) 
c3tl = (pp6 & pp3) | (pp9 & (pp3 | pp6)); (4) 
c3t2 = (pp9 & ~pp6)! (pp3 & ~pp9) | (~pp3 & pps); (5) 
c31 = c2?c3t2:c3tl; (6) 
c32 = pp2 & pp5 & pp3 & pp6 & pp9; (7) 
c41tl = pp132((pp10 & ~pp7)! (pp4 & ~pp10) | (~pp4 & pp7)):((pp7 & pp4) | (pp10 & (pp4 | pp7))) (8) 


c41t2 = pp132((~pp7 & ~pp4)l (~pp10 & (~pp4 | ~pp7))):((~pp7 & pp4) | (pp10 & ~pp4) | (~pp10 & pp7)); (9) 
c41 = c31?c41t2:c41t1; (10) 
c42 = ((c31 & pp13) & ((pp10 & (pp7 | pp4)) | (pp7 & pp4))) | ((pp10 & pp7 & pp4) & (c31 | pp13)); = (11) 
c51tl = c32?((pp14 & ~pp!1)I (pp8 & ~pp14) | (~pp8 & pp11)):((pp11 & pp8) | (pp14 & (pp8 | pp11))); (12) 
c51t2 = c322((~pp11 & ~pp8)I (~pp14 & (~pp8 | ~pp11))):((~pp11 & pp8) | (pp14 & ~pp8) | (~pp14 & pp11)); (13) 


c51 = c41?c51t2:c51t1; (14) 
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c52 = ((c41 & c32) & ((pp14 & (pp11 | pp8)) | (pp11 & pp8))) | ((pp14 & pp11 & pp8) & (c41 |c32)); (15) 


c6tl = (pp12 & pp15) | (c42 & (pp12 | pp15)); (16) 
c6t2 = (c42 & ~pp15)l (pp12 & ~c42) | (pp15&~pp12); (17) 
c6l = c51?c6t2:cétl; (18) 
c62 =c51 & c42 & ppl2 & pp15; (19) 
c71 = (c52 & pp16) | (c6l & (c52 | pp16)); (20) 


The third stage in the block diagram involves the use of XOR logic for the partial products and carry 
generated in each column. The output of this stage gives the final 16 bit product which is obtained in a 
parallel mechanism instead of sequential mechanism. 


2.2. Carry pre-computation based binary multiplier using operand decomposition 

In operand decomposition [12], the operands X and Y are decomposed into four numbers A, B, C 
and D to reduce the number of ones in the partial products. The operands are decomposed in such a way that 
the number of zeros in decomposed operand will be more when compared to number of ones. As the number 
of zeros are more, the switching activity of the circuit will be reduced which in turn reduce the dynamic 
power consumption of the architecture. 
Assuming that the two operands are X and Y have n bits, 


X = [Xn-1Xn-2.......X 1X0], and 
Y =[Yn-1Yn-2....... Y1Y0] (21) 


The four decomposed operands are given in the following 


A=-X A-Y, 
B=XAY, 
= ~X A Y, and 
D=XA-Y (22) 


Where, A is and operation & ~ is two’s complement 
The final product is determined by using equation 23. 


X*Y =(C * D)- (A *B); (23) 


The products of C*D and A*B are determined using 8-bit carry pre-computation based multiplier. 
Then the final partial sum and carry from both products can be combined carry save adder and carry look 
ahead adder. The block diagram for above multiplier is shown in Figure 5. 


3. RESULTS AND ANALYSIS 

The proposed architecture modeled using Verilog HDL, simulated using Cadence NCSIM and 
synthesized using Cadence RTL Compiler with 65nm TSMC library. Different implementation methodology 
have been taken and implemented in same technological environment and then compared the performance 
parameters. For the comparison point of view the ideas have been considered from the references and 
simulated and performance parameters was computed using the same MOSFET technology file. Input data 
was taken in a regular fashion for experimental purpose. The delay and the power measured using the worst- 
case pattern and from the output where the delay is maximum. 

It is observed that the proposed carry pre-computation based multiplier and carry pre-computation 
based multiplier with operand decomposition offered substantial reduction of propagation delay and total 
power consumptions. From Table | and Table 2, it can be observed that the proposed carry pre-computation 
based multiplier design offered ~23%, ~64%, ~57%, ~83%, ~94% when compared with array multiplier, 
wallace multiplier, column based multiplier, Nikhilam based and compressor based multipliers respectively, 
and carry pre-computation based multiplier with operand decomposition offered ~41%, ~72%, ~67%, ~87%, 
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~95% when compared with array multiplier, wallace multiplier, column based multiplier, Nikhilam based 


and compressor based multipliers respectively. 
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Figure 5. Carry Pre-Computation Based Multiplier Using Operand Decomposition 
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Table 1. Summary of Synthesis Results of 8-Bit Multiplier Architectures 

















: , Delay Dynamic Static Power Total Power Power-Delay 
eNO. ECAC OHIEY (S-Di) (ns) Power (uW) (uW) (uW) Product (pJ) 

1 Array Based Multiplier [6] 1.5 15.09 6 21.09 31.63 

2 Wallace Based Multiplier [2] 1 6.27 49.913 56.184 67.42 

3 Column Based Multiplier [9] 1.95 26.74 2.8 29.54 57.6 

4 Nikhilam Based Multiplier [10] 32 42.56 4.3 46.86 149.95 

5 Compressor Based Multiplier [11] 4.02 95.2 6.79 101.99 410.92 

6 Pre-Computation Based Multiplier 0.75 25.77 TAS 33.23 24.23 

q_ 2 Computation Based Mullplics 1.02 3.36 14.808 18.172 18.5 

with Operand Decomposition 
Table 2. Summary of Synthesis Results of 16-Bit Multiplier Architectures 
; ; Dela’ Dynamic Static Power Total Power Power-Dela 
SNe gepebiiccnine (EDI a pone (uW) (uW) (uW) Product (oD 

1 Array Based Multiplier [6] 2.89 30.18 12 42.18 121.90 

2 Wallace Based Multiplier [2] 2.46 12.54 99.826 112.366 276.42 

3 Column Based Multiplier [9] 3.82 52.48 5.4 57.88 221.10 

4 Nikhilam Based Multiplier [10] 5.96 80.65 8.1 88.75 528.95 

5 Compressor Based Multiplier [11] 8.04 190.4 13.58 203.98 1639.99 

6 Pre-Computation Based Multiplier 14 51.54 14.9 66.44 93.016 

7 Pre Computation Based Multiplier 1.96 6.72 29.616 36.336 71.218 


with Operand Decomposition 





From the Table 1 and Table 2, it can be observed that carry pre-computation based multiplier with 
operand decomposition consumes less power when compared to carry pre-computation based multiplier with 
the delay tradeoff. Proposed Carry pre-computation based multiplier with operand decomposition gave the 
better power-delay product when compared to proposed carry pre-computation based multiplier and existing 


multiplier from literature. 
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4. CONCLUSION 

In this paper, a Vedic mathematics based multiplier has been proposed which uses Carry pre- 
computation and operand decomposition methodology. The proposed architecture combines the benefits of 
Vedic method, parallel pre-computation of carries, and operand decomposition thereby resulting in reduction 
of power-delay product. The propagation delay of carry pre-computation based multiplier for calculation of 8 
bit and 16 bit multiplication was 0.75ns and 1.4ns while power consumption was 33.23 uW and 66.44 uW. 
The propagation delay of carry pre-computation based multiplier with operand decomposition for calculation 
of 8 bit and 16 bit multiplication was 1.02ns and 1.96ns while power consumption was 18.17 uW and 36.13 
uW. The delay of multiplication was decreased by ~68% and power consumption was reduced by ~61% 
when compared to Nikhilam based Vedic multiplier. 
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